incubator-tools/parsing_documentai_ json_outputs_with_jq/Parsing Document AI JSON Outputs with JQ.ipynb (222 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "id": "a944a237-82c9-42a0-a927-4c47c4348323", "metadata": {}, "source": [ "# Parsing Document AI JSON Outputs with JQ" ] }, { "cell_type": "markdown", "id": "e3447918-243b-40ea-82bd-62a18f6ea4f6", "metadata": {}, "source": [ "* Author: docai-incubator@google.com" ] }, { "cell_type": "markdown", "id": "dd73d843-3a7c-482b-bb04-831f9d5a25b3", "metadata": {}, "source": [ "## Disclaimer\n", "\n", "This documentation is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.\n" ] }, { "cell_type": "markdown", "id": "c80bb4ec-d6d1-405c-b932-0d6cd97120e0", "metadata": {}, "source": [ "## Objective\n", "\n", "JQ is a command-line tool designed for processing JSON data. It allows users to slice, filter, map, and transform structured data efficiently with its simple and intuitive syntax. This guide aims to effectively utilize JQ for analyzing JSON outputs from Google Document AI. By demonstrating practical examples.\n", "\n", "Note: The examples provided below are just a few possibilities. They can be customized based on your specific needs." ] }, { "cell_type": "markdown", "id": "f02fa54b-d8f5-45a5-b175-7bfc5c1e3c62", "metadata": {}, "source": [ "## Prerequisite\n", "\n", "JQ installed on your machine\n", "Document AI Json Files" ] }, { "cell_type": "markdown", "id": "7ac74080-8f03-45ed-aa3f-01b4a24b33c8", "metadata": {}, "source": [ "## Installation\n", "\n", "To begin with, you need to install JQ on your system. JQ is available for Linux, Windows, and MacOS, and the installation process is straightforward.\n", "\n", "* Linux: On most Linux distributions, JQ can be installed using the package manager. For example, on Ubuntu or Debian-based systems, you can install it using apt:" ] }, { "cell_type": "markdown", "id": "b27da0cf-3aae-409c-9692-edb3b6c82078", "metadata": {}, "source": [ "<img src = \"./Images/ubuntu_image.png\" width=800 height=400 alt=\"Installation in Ubuntu\"></img>" ] }, { "cell_type": "markdown", "id": "cfecbd93-8eac-41c1-9571-f606b6bdac00", "metadata": {}, "source": [ "* Windows: For Windows, download the executable from the JQ official website and add it to your system's PATH.\n", "* MacOS: On MacOS, you can use Homebrew to install JQ:" ] }, { "cell_type": "markdown", "id": "825ebf67-b8d8-403a-8d57-8072d62cc9e1", "metadata": {}, "source": [ "<img src = \"./Images/macos.png\" width=800 height=400 alt=\"macos installation\"></img>" ] }, { "cell_type": "markdown", "id": "c4ffb825-88ef-4f73-8e85-8a499d0335ab", "metadata": {}, "source": [ "After installation, you can verify that JQ is correctly installed by running jq --version in your command line or terminal. This command should return the version of JQ installed on your system." ] }, { "cell_type": "markdown", "id": "6aff14b5-bd8b-4b9c-8cc8-3a51316c4807", "metadata": {}, "source": [ "## Usage\n", "\n", "## Extract JSON entities - type, mention text, confidence\n", "\n", "This JQ command can be executed in the terminal or command prompt to extract and display the type, mentionText, and confidence from each entity in a JSON file, with the filename adjusted as necessary.\n" ] }, { "cell_type": "markdown", "id": "eb51abc2-9bf2-4efb-b29d-77b614e1b0bc", "metadata": {}, "source": [ "## Command" ] }, { "cell_type": "markdown", "id": "23533571-cea8-4d66-934c-04546435114b", "metadata": {}, "source": [ "<b><h4>jq '.entities[] | \"\\(.type) | \\(.mentionText) | \\(.confidence)\"' docai-output.json</b>" ] }, { "cell_type": "markdown", "id": "aef39833-4fb4-444a-a1a3-ee7e63a4ae4f", "metadata": {}, "source": [ "## Output\n", "<img src=\"./Images/jq_output_1.png\" width=800 height=400 alt='output1'></img>" ] }, { "cell_type": "markdown", "id": "cfede701-29fd-4355-82c4-f6211df35748", "metadata": {}, "source": [ "## Extract JSON entities with child entities - type, mention text, confidence\n", "\n", "This JQ command can be run in the terminal or command prompt to extract and display the type, mentionText, and confidence from each entity's properties in a JSON file, using a default empty array when properties are missing, with the filename adjusted as necessary." ] }, { "cell_type": "markdown", "id": "6d46db1b-8069-4bb3-8b4b-dc3622b1d6fc", "metadata": {}, "source": [ "## Command\n", "\n", "<b><h4>jq '.entities[] | .properties // [.] | .[] | \"\\(.type) | \\(.mentionText) | \\(.confidence)\"' docai-output.json**" ] }, { "cell_type": "markdown", "id": "e518993c-46b8-4004-8434-f9f8e188bddd", "metadata": {}, "source": [ "## Output\n", "<img src=\"./Images/jq_output_image_2.png\" width=800 height=400 alt='output2'></img>" ] }, { "cell_type": "markdown", "id": "7c2f2674-c485-44f4-9a05-c94212a1e6aa", "metadata": {}, "source": [ "## Export to CSV\n", "\n", "This command can be executed in the terminal or command prompt to create a CSV file named output.csv with headers, and then append extracted type, mentionText, and confidence from each entity in a JSON file. Adjust the filename as necessary." ] }, { "cell_type": "markdown", "id": "b2d61762-295c-4189-b450-949a25cf1a5a", "metadata": {}, "source": [ "## Command\n", "\n", "<b><h4>echo \"type,mentionText,confidence\" > output.csv\n", "\n", "<b><h4>jq -r '.entities[] | .properties // [.] | .[] | [.type, .mentionText, .confidence] | @csv' docai-output.json >> output.csv" ] }, { "cell_type": "markdown", "id": "7d1722f8-b0c1-49da-991c-9126ab167030", "metadata": {}, "source": [ "## Output\n", "<img src=\"./Images/csv_output.png\" width=800 height=400 alt='csv_output'></img>" ] } ], "metadata": { "environment": { "kernel": "python3", "name": "common-cpu.m112", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/base-cpu:m112" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 5 }